今天這邊使用 python 的 Beautiful Soup 模組來試試簡單的網頁爬蟲.
安裝 beautifulsoup4
pip3 install beautifulsoup4
爬 yahoo電影排行榜 的電影排名內容.首先要能取得網頁內容,需要多使用 ssl 模組,否則會遇到 CERTIFICATE_VERIFY_FAILED 錯誤.
>>> import ssl
>>> context = ssl._create_unverified_context()
>>> req_obj = request.Request('https://movies.yahoo.com.tw/chart.html')
>>> with request.urlopen(req_obj,context=context) as res_obj:
>>>  print(res_obj.read())
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n  <meta charset="UTF-8">\n  <meta name="viewport" content="width=device-width, initial-scale=1, user-minimum-scale=1, maximum-scale=1">\n  <meta http-equiv="content-type" content="text/html; charset=utf-8">\n  <meta property="fb:app_id" content="501887343352051">\n  <meta property="og:site_name" content="Yahoo\xe5\xa5\x87\xe6\x91\xa9\xe9\x9b\xbb\xe5\xbd\xb1">\n    <title>\xe5\x8f\xb0\xe5\x8c\x97\xe7\xa5\xa8\xe6\x88\xbf\xe6\xa6\x9c
......
使用 html.parser 來 parser 讀取的網頁內容,使用soup.prettify 可以看網頁內容.
>>> from bs4 import BeautifulSoup
>>> with request.urlopen(req_obj,context=context) as res_obj:
...  resp = res_obj.read().decode('utf-8')
...  soup = BeautifulSoup(resp , 'html.parser')
...  print(soup.prettify())
...
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, user-minimum-scale=1, maximum-scale=1" name="viewport"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="501887343352051" property="fb:app_id"/>
  <meta content="Yahoo奇摩電影" property="og:site_name"/>
  <title>
   台北票房榜 - Yahoo奇摩電影
  </title>
  ...
接著要去找要爬的內容的網頁區塊,找電影排名的區塊是被 <div class="rank_list table rankstyle1"> 包起來的.
<div class="rank_list table rankstyle1">
    <div class="tr top">
      <div class="td">本週</div>
      <div class="td updown"></div>
      <div class="td">上週</div>
      <div class="td">片名</div>
      <div class="td">上映日期</div>
      <div class="td">預告片</div>
      <div class="td">網友滿意度</div>
    </div>
        
完整的爬蟲程式
import ssl
from urllib import request, parse
from bs4 import BeautifulSoup
context = ssl._create_unverified_context()
req_obj = request.Request('https://movies.yahoo.com.tw/chart.html')
with request.urlopen(req_obj,context=context) as res_obj:
 resp = res_obj.read().decode('utf-8')
 soup = BeautifulSoup(resp , 'html.parser')
 rows = soup.find_all('div', class_ = 'tr')
 colname = list(rows.pop(0).stripped_strings)
 contents = []
 for row in rows:
  thisweek_rank = row.find_next('div' , attrs={'class' : 'td'})
  updown = thisweek_rank.find_next('div')
  lastweek_rank = updown.find_next('div')
  if thisweek_rank.string == str(1):
   movie_title = lastweek_rank.find_next('h2')
  else:
   movie_title = lastweek_rank.find_next('div' , attrs={'class' : 'rank_txt'})
  release_date = movie_title.find_next('div' , attrs={'class' : 'td'})
  trailer = release_date.find_next('div' , attrs={'class' : 'td'})
  if trailer.find('a') is None:
   trailer_address = ''
  else:
   trailer_address = trailer.find('a')['href']
  starts = row.find('h6' , attrs={'class' : 'count'})
  lastweek_rank = lastweek_rank.string if lastweek_rank.string else ''
  c = [thisweek_rank.string , lastweek_rank , movie_title.string , release_date.string , trailer_address , starts.string]
  contents.append(c)
print(contents)
執行 crawler.py
> python3 crawler.py
[['1', '1', '返校', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E8%BF%94%E6%A0%A1-400%E7%A7%92%E5%B8%B6%E4%BD%A0%E5%9B%9E%E9%A1%A7%E9%9B%BB%E5%BD%B1%E5%8E%9F%E5%9E%8B%E6%95%85%E4%BA%8B-xxy-111923492.html', '4.3'], ['2', '2', '天氣之子', '2019-09-12', 'https://movies.yahoo.com.tw/video/%E7%84%A1%E9%9B%B7%E5%BD%B1%E8%A9%95-%E5%A4%A9%E6%B0%A3%E4%B9%8B%E5%AD%90-%E8%A8%BB%E5%AE%9A%E8%A9%95%E5%83%B9%E5%85%A9%E6%A5%B5%E7%9A%84%E5%8B%95%E7%95%AB%E9%9B%BB%E5%BD%B1-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-030333793.html', '4.3'], ['3', '3', '星際救援', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E6%98%9F%E9%9A%9B%E6%95%91%E6%8F%B4-%E8%AA%B0%E6%89%8D%E6%98%AF%E5%AE%8C%E7%BE%8E%E5%A4%AA%E7%A9%BA%E4%BA%BA-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-043512139.html', '3.8'], ['4', '', '青春豬頭少年不會夢到懷夢美少女', '2019-09-27', '', '4.5'], ['5', '', '無間行動', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%84%A1%E9%96%93%E8%A1%8C%E5%8B%95-%E5%85%A8%E9%9D%A2%E9%80%83%E6%AE%BA%E7%89%88%E9%A0%90%E5%91%8A-025134973.html', '4.1'], ['6', '5', '全面攻佔3: 天使救援', '2019-08-21', 'https://movies.yahoo.com.tw/video/%E5%85%A8%E9%9D%A2%E6%94%BB%E4%BD%943-%E5%A4%A9%E4%BD%BF%E6%95%91%E6%8F%B4-%E8%8B%B1%E9%9B%84%E5%88%B0%E5%BA%95%E9%80%80%E4%B8%8D%E9%80%80%E5%A0%B4-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-034051084.html', '4.2'], ['7', '4', '牠 第二章', '2019-09-05', 'https://movies.yahoo.com.tw/video/%E7%89%A0-%E7%AC%AC%E4%BA%8C%E7%AB%A0-%E8%A7%A3%E6%9E%90-%E8%A2%AB%E7%BE%8E%E8%B2%8C%E8%A9%9B%E5%92%92%E7%9A%84%E8%B2%9D%E8%8A%99%E8%8E%89%E9%A6%AC%E8%A8%B1-160000560.html', '4'], ['8', '', '瞞天機密', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%9E%9E%E5%A4%A9%E6%A9%9F%E5%AF%86-%E5%8B%87%E6%B0%A3%E7%89%88%E9%A0%90%E5%91%8A-084815060.html', '4.1'], ['9', '', '信用詐欺師JP', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E4%BF%A1%E7%94%A8%E8%A9%90%E6%AC%BA%E5%B8%ABjp-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-062304730.html', '4'], ['10', '', '囧媽的極地任務', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E5%9B%A7%E5%AA%BD%E7%9A%84%E6%A5%B5%E5%9C%B0%E4%BB%BB%E5%8B%99-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-025032372.html', '4.2'], ['11', '', '校外打怪教學', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E6%A0%A1%E5%A4%96%E6%89%93%E6%80%AA%E6%95%99%E5%AD%B8-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-062837459.html', '3.7'], ['12', '10', '普羅米亞', '2019-08-16', 'https://movies.yahoo.com.tw/video/%E6%99%AE%E7%BE%85%E7%B1%B3%E4%BA%9E-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-144302686.html', '3.8'], ['13', '', '變身', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E8%AE%8A%E8%BA%AB-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-084131268.html', '3.8'], ['14', '', '笑笑羊大電影:外星人來了', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%AC%91%E7%AC%91%E7%BE%8A%E5%A4%A7%E9%9B%BB%E5%BD%B1-%E5%A4%96%E6%98%9F%E4%BA%BA%E4%BE%86%E4%BA%86-%E4%B8%AD%E6%96%87%E9%85%8D%E9%9F%B3%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-030458730.html', '4'], ['15', '8', '唐頓莊園', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E5%94%90%E9%A0%93%E8%8E%8A%E5%9C%92-%E5%9B%9E%E9%A1%A7%E7%AF%87-044725185.html', '4.1'], ['16', '7', '極限逃生', '2019-08-30', 'https://movies.yahoo.com.tw/video/%E6%A5%B5%E9%99%90%E9%80%83%E7%94%9F-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-134635519.html', '4.1'], ['17', '6', '第九分局', '2019-08-29', 'https://movies.yahoo.com.tw/video/%E7%AC%AC%E4%B9%9D%E5%88%86%E5%B1%80-%E5%8B%95%E4%BD%9C-%E7%89%B9%E6%95%88%E8%88%87%E5%8C%96%E5%A6%9D%E7%AF%87-130453384.html', '3.9'], ['18', '', '雪地之光', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E9%9B%AA%E5%9C%B0%E4%B9%8B%E5%85%89-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-033605254.html', '3.6'], ['19', '12', '殺手餐廳', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E6%AE%BA%E6%89%8B%E9%A4%90%E5%BB%B3-%E8%9C%B7%E5%B7%9D%E5%AF%A6%E8%8A%B1%E5%B0%8E%E6%BC%94%E7%AF%87-065439673.html', '3.9'], ['20', '9', '好小男孩', '2019-09-12', 'https://movies.yahoo.com.tw/video/%E5%A5%BD%E5%B0%8F%E7%94%B7%E5%AD%A9-%E5%B9%95%E5%BE%8C%E8%8A%B1%E7%B5%AE%E7%AF%87-122756018.html', '3.5']]
練習去 好樂迪KTV 的網站爬曲前幾名的歌曲.到官網後用檢視原始碼找到排行的 HTML 區塊如下,可以看到 table 裡第一個 tr 跟最後一個 tr 不是歌曲排行的內容.所以到時候要濾掉.
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgSong"
       style="background-color:White;border-color:White;border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
    <tbody>
        <tr align="center" valign="middle" style="color:White;background-color:Black;">
            <td>本週</td>
            <td>上週</td>
            <td>週數</td>
            <td align="center" valign="middle">點歌<br>曲號
            </td>
            <td align="center" valign="middle" style="width:34%;">歌名</td>
            <td align="center" valign="middle">歌手</td>
        </tr>
        <tr align="center" valign="middle" style="background-color:#EAEAEA;">
            <td style="background-color:#666666;">
                <font size="4">
                    <span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbThisWeek"
                          style="color:White;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">1
                    </span>
                </font>
            </td>
            <td>
                <font size="4">
                    <span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbLastWeek"
                          style="color:#333333;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">4
                    </span>
                </font>
            </td>
            <td>
                <font size="4">
                    <span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbWeeks"
                          style="color:#999999;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">3
                    </span>
                </font>
            </td>
            <td align="center" valign="middle" style="font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">
                26009
            </td>
            <td align="center" valign="middle">來個蹦蹦</td>
            <td align="center" valign="middle">
                <a href="#" onclick="javascript:GoSearch("玖壹壹.Ella(陳嘉樺)                    ");">
                    玖壹壹.Ella(陳嘉樺)
                </a>
            </td>
        </tr>
        <tr align="center" valign="middle" style="background-color:#CCCCCC;">
            ......
        </tr>
        
        <tr align="center" style="font-weight:bold;text-decoration:none;width:100%;">
            <td colspan="6"><span>1</span> <a
                    href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl03','')">2</a> <a
                    href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl04','')">3
            </a>
                <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl01','')">下一頁</a>
            </td>
        </tr>
    </tbody>
</table>
建立一支爬蟲程式holiday.py,首先先取回網頁內容,然後建立 BeautifulSoup 物件,就可以透過該物件提供的 api 去取得需要的元素的值,find_all會把多種條件全部取回來,而find只會取回第一個.這邊有使用到 pandas 模組這邊先不做介紹直接使用,會有另一篇介紹.
import ssl
from urllib import request, parse
from bs4 import BeautifulSoup
import pandas as pd
# 使用 ssl 模組,避免遇到 CERTIFICATE_VERIFY_FAILED 錯誤
context = ssl._create_unverified_context()
# 給好樂迪的網址建立 Request
req_obj = request.Request('https://www.holiday.com.tw/song/Billboard.aspx')
song_list = []
# 發送 request
with request.urlopen(req_obj,context=context) as res_obj:
       # 將 response 讀回並用 utf8 decode 
	resp = res_obj.read().decode('utf-8')
        # 使用 html.parser
	soup = BeautifulSoup(resp , 'html.parser')
        # 用 find 找到 id 為 ctl00_ContentPlaceHolder1_dgSong 的 table 標籤,並回傳 table 內所有的 tr 內容
	rank_table = soup.find('table',id='ctl00_ContentPlaceHolder1_dgSong').find_all('tr')
        #由於要避開 table 的第一列 tr 資料以及最後一列 tr 資料,所以取 [1:-2] 
	for rt in rank_table[1:-2]:
               # 找到所有的 td 並取得第 5 個 td(index 是 4)
		song_name = rt.find_all('td')[4]
               # 找到第一個 a 這個標籤,因為只有歌手的資料被 a tag 包住
		singer = rt.find('a')
        # 把歌曲跟歌手的資料轉成 string 並去前後空白塞到一個 song_list
	song_list.append([song_name.string.strip(),singer.string.strip()])
# 把 song_list 使用 pandas 模組轉成 dataframe 用於後面資料分析
df = pd.DataFrame(song_list,columns=['song','singer'])
print(df)
執行結果
> python3 holiday.py
          song         singer
0         來個蹦蹦  玖壹壹.Ella(陳嘉樺)
1           過客            莊心妍
2         I Go            周湯豪
3           走心            賀敬軒
4      多想留在你身邊            劉增瞳
5       終於了解自由            周興哲
6   沒有你陪伴真的好孤單             夢然
7       此刻你聽好了            劉嘉亮
8      說一句我不走了            林芯儀
9   Be Alright         高爾宣OSN
10        可不可以            季彥霖
11      至少我還記得            周興哲
12          預謀            許佳慧
13        知否知否         胡夏.郁可唯
14         太空人            吳青峰
15      重感情的廢物          TRASH
16          何妨         家家.茄子蛋
17          太空            吳青峰
18         兩秒終            周湯豪